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Abstract 

A recently developed statistical mechanical Quasi-Chemical Theory (QCT) has led to significant 
insights into solvation phenomena for both hydrophilic and hydrophobic solutes. The QCT exactly 
partitions solvation free energies into three components: 1) inner-shell chemical, 2) outer-shell 
packing, and 3) outer-shell long-ranged contributions. In this paper, we discuss efficient methods 
for computing each of the three parts of the free energy. A Bayesian estimation approach is 
developed to compute the inner-shell chemical and outer-shell packing contributions. We derive 
upper and lower bounds on the outer-shell long-ranged portion of the free energy by expressing 
this component in two equivalent ways. Local, high energy contacts between solute and solvent are 
eliminated by spatial conditioning in this free energy piece, leading to near-Gaussian distributions 
of solute-solvent interactions energies. Thus, the average of the two mean-field bounds yields an 
accurate and efficient free energy estimate. Aqueous solvation free energy results are presented 
for several solutes, including methane, perfiuoromethane, water, and the sodium and chloride ions. 
The results demonstrate the accuracy and efficiency of the methods. The approach should prove 
useful in computing solvation free energies in inhomogeneous, restricted environments. 

PACS numbers: 82.60.Lf,87.16.A-,61.20.Ja,64.70.qd,64.75.Bc 
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I. INTRODUCTION 



Free energy calculations are a central challenge in modern statistical mechanics.^ Ac- 
curate evaluations of free energies are crucial for understanding a wide range of chem- 
ical and physical phenomena, including chemical reactions in solution ,-1^11^1^'^ solvation 
free energies,— ^'^'^'^'^'^ potentials of mean force between complex molecular species in 
water,— acid-base equilibria, l^'AS ligand/drug binding interactions,-i^'i^'i^'i2i'^ and phase 
equilibria.— Free energies are composed of enthalpic and entropic pieces, both of which may 
be more difficult to estimate than the free energy itself during computer simulations.->2^ 
Due to the high sensitivity of computed free energies, enthalpies, and entropies to force field 
variations, and the possibility of direct and accurate experimental testing of the computed 
results, calculations of these thermodynamic quantities provide a proving ground for the 
development of viable molecular models.— i^^i^iiS^ 

A wide range of theoretical and numerical methods has been developed for assessing free 
energy changes.— The standard toolbox of free energy calculations includes the thermody- 
namic integration (TI) and free energy perturbation (FEP) techniques. The theory behind 
these methods was developed long ago by Kirkwood,^'' Landau and Lifshitz,— Zwanzig,— and 
others. More recent developments have included the histogram overlap,— Bennett accep- 
tance ratio (BAR),— importance (or umbrella) sampling,— adaptive biasing force (ABF),— 
weighted histogram (WHAM),— replica exchange,— resolution exchange,— metadynamics,— 
and non-equilibrium work methods.—"^ Ref. Q offers a thorough overview of the fundamen- 
tals of these and other modern approaches to free energy calculations. 

Often a coupling parameter is introduced that, when changed from zero to one, grad- 
ually links the initial and final states along some path. This partitioning of the problem 
along a coupling path is typically referred to as stratification or staging. The FEP ap- 
proach samples the exponential of the energy difference between two steps in an alchemical 
transformation,— while TI samples the analytic derivative of the Hamiltonian with respect 
to the coupling parameter for a given value of that parameter.^ A key feature of these 
traditional free energy methods is that there must be robust sampling of the phase spaces 
along the coupling parameter transition; the smaller are the coupling parameter increments, 
the larger are the overlaps of the phase spaces sampled throughout the transition. Extensive 
work has gone into optimizing algorithms to enhance these phase space overlaps.— '^'^'^'^ 
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In the present paper, we develop an alternative approach based on a spatial stratification 
of the free energy contributions, and apply the methods to the computation of absolute 
hydration free energies. 

As suggested above, FEP calculations invariably encounter the same fundamental dif- 
ficulty, namely that of sampling the distribution of the energy change.-"^ For the case of 
solvation free energy considered here, and without any stratification, this energy change is 
the interaction energy between the solvent and its environment. Taking the exponent of this 
quantity, however, makes the estimate highly sensitive to the contributions from tail regions 
of the interaction energy probability distribution. The probability of sampling these tail 
regions during standard simulations may be low, leading to poor statistics, and results from 
naive applications of the exponential averaging formula may thus incur substantial errors. 
In addition, the averaging contains a systematic bias.— i^i^ 

The above point can be illustrated with the histogram overlap method. Fig. [Tl— That 
method takes advantage of data from two simulations (initial and final states) by noting that 
the logarithm of the ratio of these energy distributions is related to the free energy difference. 
We display the interaction energy distributions for the coupled and uncoupled limits for 
methane solvation in water. In the figure, it can be seen that the probability distribution 
of the solute-solvent interaction energy undergoes significant changes during the solvation 
process. Those configurations whose energies are most important for characterizing the 
solvated state occupy only a small fraction of the total configurational space of the uncoupled 
system. For the displayed case, it is clear that, even though methane is a reasonably small- 
sized molecule, large simulation times are required to adequately sample the low-energy tail 
of the uncoupled distribution (as it occupies only 0.2% of the sample space). An exponential 
averaging formula does not work for the fully coupled case, since the high-energy, exponential 
tail is not adequately sampled during simulations; direct implementation of the exponential 
averaging formula leads to a 16 kJ/mol error in the free energy. Larger cases such as CF4 
become even more problematic,-^ and no detectable overlap would occur for a small peptide, 
for example. 

Recognizing this fundamental difficulty, some free energy methods attempt to circumvent 
the overlap problem by a reformulation designed to minimize the error. Stratification,^° im- 
portance sampling,— and the multicanonical,— BAR,— and WHAM— methods are efforts 
in that direction. Utilization of Tsallis statistics^"^ is another possibility in which the 
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canonical probability distribution is re-expressed in a form that has the Boltzmann distri- 
bution as a particular limit, effectively smoothing out the potential surface. This allows the 
higher-energy regions to be reached more frequently and can improve the sampling of the 
tail regions at the expense of less detailed exploration of local minima. 

Other approaches include cumulant expansions of the free energy^ and high-order TI 
formulas.— A cumulant expansion is a special case of fitting the energy probability distri- 
bution to an approximate analytic form. Cumulant expansions to second order may be 
particularly effective in some circumstances, since the two-moment formula is exact in the 
case that the energy is Gaussian-distributed. Combinations of accurate, high-order TI inte- 
gration formulas with cumulant expansions (to obtain derivatives in a Taylor expansion of 
the free energy) have led to very efficient free energy estimates.— 

The Potential Distribution Theorem (PDT) provides another view of solvation thermo- 
dynamics that can be helpfully employed in free energy calculations.—'^*^ It results in a 
near-local partition function expression for the excess chemical potential. That excess chem- 
ical potential is the solvation free energy sought here. Following the original formulation 
by Widom,— the PDT has found wide application in models of small molecule solvation via 
the test particle method. More recently, the PDT has been further expanded as a general 
statistical mechanical approach for molecular liquids.^'' Part of that conceptual expansion 
has involved the development of a quasi-chemical theory (QCT)^'^'^'^'^'^ that exactly 
partitions the free energy into inner-shell (IS) and outer-shell (OS) components. The ex- 
act expression can then be manipulated to make physical approximations for the IS and 
OS parts. The OS portion of the free energy can be further decomposed into packing and 
long-ranged contributions.—*^ The term 'long-ranged' used in this context indicates the free 
energy for interaction of the solute with the solvent, with solvent molecules excluded from 
the inner-shell region. 

The spatial partitioning may be utilized to perform, for example, ab initio calculations 
for the IS domain, while making classical mechanical or even continuum approximations 
for the OS region.—*^ The QCT has been found especially effective for understanding ion 
solvation phenomena,— i^i^i^i^ leading to highly accurate estimates of the free energies. 
Recently, quasi-chemical theory has been further applied to the calculation of the aqueous 
solvation free energies of water— and small nonpolar molecules such as perfluoromethane 
(CF^)^ and methane.— These studies have found that the absence of close-contact repulsion 
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in the OS long-ranged part of the free energy induced by the QCT partitioning leads to 
near-Gaussian energy distributions. In this paper we exploit that observation to develop 
efficient simulation methods for computing this portion of the free energy. A novel statistical 
procedure for estimation of the IS and OS packing contributions is also presented, leading 
to a robust procedure for calculating absolute solvation free energies directly from molecular 
dynamics sampling. 

The structure of the paper is as follows. Section [TTl introduces the theoretical motivation 
behind the QCT and the sampling strategies employed, and derives bounds on the OS long- 
ranged part of the free energy. Section [Till develops a Bayesian approach for computing the 
IS free energy and the OS packing contribution. Section IIVI illustrates how a continuous 
model repulsive potential can be included to allow for molecular dynamics sampling in 
the free energy computations, and Section |V] discusses details of the simulations reported 
here. Section |Vl] presents results of absolute solvation free energy calculations for methane, 
perfluoromethane, TIP3P water, and the sodium and chloride ions. The final section presents 
our conclusions and discusses potential applications to more advanced free energy problems. 



II. QUASI-CHEMICAL THEORY 

In this section, we review the Potential Distribution Theorem (PDT) and its quasi- 
chemical theory (QCT) form to set the stage for developing efficient computational ap- 



proaches for calculating the free energy via spatial partitioning; see Refs. 
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49 for further 



discussion. The PDT expression for the chemical potential of chemical species is 

= In [p^{r)Al/q^'] - ln((exp(-/5Af/,)|r))o. (1) 

We typically express the chemical potentials in unitless form, Pfia, where (3 = l/ksT {ks 
is the Boltzmann constant and T is the temperature). The first term on the rhs is the 
ideal chemical potential involving the molecular center-of-mass number density Po(r), the 
thermal deBroglie wavelength A^, and the gas phase partition function g™* for the internal 
states of the molecule (vibrations, rotations, and electronic) . The second term is the excess 
chemical potential evaluated at the center-of-mass point r. The double brackets with zero 
subscript imply that the solvent molecules and the solute are sampled independently; 
then the two independent subsystems are superimposed and the Boltzmann factor of the 
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interaction energy (At/a = Un+i — Un) is averaged. It is clear that, if the solute molecule is 
large, in practice most superimposed subsystems will lead to significant overlap of the hard 
molecular cores, even though the formula is rigorously correct. This is one motivation for 
the development of the partitioning scheme of Quasi- Chemical Theory {QCT.)^^^^ 

In this paper, the chemical potential sought is for a single solute in dilute solution, so 
no a label is necessary. We consider solute molecules either of the united atom type or 
with fixed internal coordinates, dissolved in a homogeneous liquid (water). Thus, we also 
do not need to consider the sampling of the solute internal coordinates, and we can drop 
the position r. The PDT then reduces to 

= In [pA^] - ln(exp(-/5Af/))o. (2) 

Our focus will be on the excess chemical potential, that is the contribution over and 
above the ideal term: 

/?/i^^ = -ln(exp(-/3A[/))o. (3) 
The average on the right hand side can be expressed as 

where [/at represents the interaction energy for the solvent molecules, and AU is the 
interaction energy of the solute with the solvent. The above notation is a schematic in 
which the integration over all "iN solvent molecular degrees of freedom is represented by 
/ dx^ while the solute center-of-mass is held fixed. 
Alternatively, we can express Eq. (jl]) as 



e 



= e-^^^- = j deP^'\e)e-^% (5) 



where 

pW(e) = (5(e-Af/))o (6) 

is the distribution of the interaction energies in the uncoupled case. 
Similarly, we can say 

{e^^U) = e^^'"" = j deP{e)e^', (7) 

where 

P{e) = {6{e-AU)). (8) 



The lack of a zero subscript for the average in Eq. ([8]) imphes that the solute is included in 
the sampling. This is the inverse form of the PDT. There is a simple relationship between 
the coupled and uncoupled binding energy distributions:— 

p(,) = e-/3(e-/.-)p(o)^^)_ (9) 

Thus, if one of the two distributions were to assume a Gaussian form, then so would the 
other. At another extreme, if one distribution were roughly constant, then the other would 
be near-exponential. 

We can partition the excess chemical potential by inserting unity into Eq. (jlj) in the form 



(10) 



where AU-nsW is a hard-sphere potential and A is the distance from the center of the hard 
sphere to the center of the nearest solvent molecule. The radius A divides the system into 
inner-shell (IS) and outer-shell (OS) domains. It is typically chosen so that the solute/solvent 
radial distribution function at that distance has appreciable nonzero values, so as to include 
some solvent molecules during thermal sampling; distances out to the first minimum in the 
radial distribution function might be considered. Thus the IS domain includes the solute 
and some solvent molecules directly interacting with the solute, and the OS region includes 
all of the rest of the solvent. 
After a slight rearrangement 

(e-'^^^)o = ^(e-^^^)A. (11) 

Here 

Po(A) = (e-^^^-W)o, (12) 

and 

a;o(A) = (e-'3^^Hs(A)^_ (13) 

The sampling leading to the average labeled by A in Eq. (ITT!) includes the hard particle with 
size specified by A. Also, the solute is included in the sampling that yields a;o(A): 

xo{X) = (e-^^^Hs(A)^ ^ J dx^e-^^^e-^^^ " ^^'^ 

The quantity that is averaged in Eq. (fT^ . exp[—pAUus{X)], is an 'indicator function' that 
is one if there are no overlaps with solvent molecules and zero if overlaps exist.— The term 



xo(A) is the probability of observing no solvent centers within the IS region while the solute 
molecule is situated at the center of the of the inner shell. The term po(A) is the probability 
that no solvent centers are located within the IS domain with no solute in that domain. 
Thus the excess chemical potential can be written as 

Pfi'^ = Inxo(A) - Inpo(A) - ln(e-^^^)A. (15) 

It is important to note that this spatial partitioning of the free energy is exact; the intro- 
duction of a hard-sphere particle is an artifice to enact the partitioning. The result for the 
excess chemical potential is independent of the choice of A, as we will see in our modeling 
results, and has been observed in previous calculations.— There may be judicious choices for 
A, though, that can lead to helpful approximations that enhance the efficiency and utility 
of the theory. 

The first term on the rhs of Eq. (JTSil is the IS contribution to the excess chemical potential, 
/3/ijg(A). It corresponds to minus the work necessary to move the nearby solvent molecules 
from direct contact with the solute to outside the IS domain. Thus, this term can also be 
viewed as a chemical contribution that is, roughly speaking, the free energy to bind the 
nearest solvent molecules to the solute. The second term is the work to grow in a cavity 
of size A in the solvent. This is the OS packing contribution /^/^os hs('^)- '^^^ final term is 
the free energy of interaction of the solute with the solvent with all of the solvent molecules 
expelled from the IS region by inclusion of the hard sphere in the sampling, /S/xg'g lr(A), 
where 

We can view this expression in a different way: 

{e~'^^^)x = 



Jdx^e-^^^ J dx^e-^u^e-^^u^sW 



^g-/3AC/Hs(A)g-/3A(7^^ 







(e-/3AC7Hs(A))^ 



Po(A) 

P^'^\e, Tinin > A) is the joint probability of observing the coupling energy e and the condition 
'"min > A, where r^i^ is the distance of closest approach of the solvent molecules to the solute. 
But the ratio of probabilities in Eq. f[T7l) is simply the conditional probability P'-'^^(e|rjnin > 
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A), so 

e-^^os,LRW = {e-^^u^x = j deP^'\e\r^^ > A)e-^^ (18) 

where the label LR is for the long-ranged contribution to the free energy. Thus, instead of 
inserting the HS solute in the sampling, we could sample the uncoupled system and average 
the distribution of interaction energies conditioned on the absence of solvent molecules in 
the IS region. 

We can re-express the OS long-ranged contribution in a form that includes the coupled 
sampling by a rearrangement: 

)^ ^ J ^3,Afg-/3C/jvg-/3Al/Hs(A) 

/ c;a;^e-^^^e-^^^Hs(A)g-/3A(7 

^ J (^a;Afg-/3C/ivg-/3A[/Hs(A)g-/3A(7g/3Ai7 
h+AU 



^'^""Uau- (19) 



The averaging in the last step occurs for a system containing a fully interacting solute 
particle in addition to a hard-core particle of size A. Then the excess chemical potential in 
Eq. (fT5|) can also be written as 

(311^^ = InxoiX) - Inpo(A) + ln(e^^^)A+A[/. (20) 
The averaging in the last term is equivalent to 

e''os,.nW = (e/^^^),^^^ = I deP{eK,^ > A)e^^ (21) 

We note that a relation similar to Eq. iQ holds here with the distributions replaced by the 
conditioned ones, and the excess chemical potential replaced by the long-ranged contribution. 
The approach of Eq. (^T^ has been taken by Pratt and coworkers.— i^i^ It avoids issues of 
de- wetting/re- wetting which may occur if a hard particle reference system is used, and the 
attractive interactions are subsequently turned on. The averaging that leads to P{e\r^ui > A) 
involves the real solute in the sampling, and requires the observation of configurations in 
which no solvent molecules are located in the IS region. For small cavity sizes, this approach 
is feasible, but observations of spontaneous larger cavities in water are rare. 

Shah, et al.,— Asthagiri, et al.,— and Asthagiri, et ai— observed that the conditioning 
inherent in Eq. (pTj) leads to a near-Gaussian form for the distribution of interaction energies. 
This is because moving the solvent molecules some distance from the solute leads to many 



similar interactions smaller in magnitude, and the interactions do not include large-deviation 
(repulsive) overlaps with the solute. With no conditioning, the distributions tend more 
toward an extreme-value (exponential) form due to the overlaps of the solute with nearby 
solvent molecules. 

So we can view the conditioning in one of two ways: either include the hard particle 
directly in the sampling, or average the distributions only over configurations that exhibit no 
penetration of the solvent molecules into the inner shell. The two approaches are equivalent, 
but the former approach allows for more efficient sampling, especially if the IS radius becomes 
large. In both cases, the effect of pushing the solvent molecules out of the IS domain leads 
to the same, approximately Gaussian interaction energy distributions. 

Due to the near-Gaussian form for the distributions, a second-order cumulant expansion 
for the OS long-ranged contribution is appropriate. The expansion from Eq. f|T5|) yields 

/9/^os,lr(A) = - ln(e-^^^), ^ /5(Af/), - ^ [{AU'), - {AU)l] +... (22) 

Similarly, Eq. fl20|) results in 

/3/^os,lr(A) = ln(e^^^),+Af/ ^ (3{AU),^au + y [{AU'),+au - (At/)tAt/] + • • • (23) 

Thus the mean-field {AU)x and (A[/)a+ai/ are upper and lower bounds, respectively, on the 
long-ranged part of the free energy by the Gibbs-Bogoliubov inequality.— '^'^'^'^'^ If the 
interaction energy distribution for either the coupled or the uncoupled parts is approximately 
Gaussian, then the other is also near-Gaussian (with a similar width), and adding the 
two expressions leads to a near-cancellation of the fiuctuation terms. Hence, a sensible 
approximation to test would be 

/?/ios,LR(A) ^ f [{AU), + {AU),^Au] . (24) 

This formula is similar to one of the thermodynamic integration formulas developed by 
Hummer and Szabo,^'* but is obtained in a different context here. 

III. BAYESIAN ESTIMATES OF OUTER-SHELL PACKING AND INNER- 
SHELL CHEMICAL CONTRIBUTIONS 

In this section, we develop a Bayesian approacb^*^ for estimating both the //qs hs(A) and 
/xfg(A) contributions to the excess chemical potential. This method generalizes a discussion 
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of spatial stratification outlined in Ref. |47| (Ch. 5). The probabilities for the OS packing and 
IS chemical contributions, Eqs. f|T2l) and f|T3|) . relate to the probability that a defined region 
of space is unoccupied, without or with the solute present, respectively. The occupancy 
condition can be reduced to monitoring the distance from the solute to the center of the 
nearest solvent molecule rmin- This creates a one-to-one relationship between the probability 
distribution of the closest solvent molecule P (rmin) and the probability that determines the 
above chemical potentials P {r^in > Aj): 

p{Xi) = P (rmin > Xi) = P (rmin) dr^in = 1 " / P ('"min) rf^^min- (25) 

J\i Jo 

This means that calculating the chemical potential profile for different values of Aj reduces 
to a problem of accurately estimating the distribution of rmin- 

To exhibit the physical background of the above discussion, we first review the relation 
of the distribution of closest approach to the scaled particle theory (SPT) for the packing 
part of the free energy. An integral equation for the distribution P (rmin) can be derived by 
noting that this distribution is the probability of observing a cavity of size rmin O'^d a solvent 
particle in the range rmin — ^ '"min + c^'^min (see Fig. [2]). This is just the conditional probability 
of observing a solvent particle in the window given an rmin-sized cavity {4:nr'^^^^G{r^i^)p) 
times the cavity probability ([l — /q*^""" P {s)ds\), where G'(rniin) is the contact value of the 
radial distribution function for the hard sphere interacting with the solvent, and p is the 
solvent density: 

Z*^ mill 

P (rmin) = 47rr^mpG'(rmin) 1 - / P is)ds , (26) 
The above integral equation can be solved analytically, yielding 

- J Ans^pG{s)ds\ . (27) 

When this expression is inserted into Eq. (!25|) for P (r^in > Aj), the resulting excess chemical 
potential is 

/9/Xos,Hs(A) = / 47rrVG(r)rfr, (28) 
which is just the SPT expression.— If G{r) = 1 in the above equation (ideal gas case), we 
obtain the Hertz distribution for P (rmin) with the excess chemical potential PPosnsW ~ 
47rA'^p/3.— Instead of solving directly for the contact value G{r), here we develop a 
Bayesian estimation approach for obtaining p{Xi) in Eq. (!25l) from occupancy statistics data 
(in discretized spatial shells) assembled during the simulations. 
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In order to calculate the OS packing and IS chemical potential contributions, Eqs. (fT2|) 
and f|T3|) . we conduct two sets of simulations with the solute having either a zero interaction 
or a full interaction with the solvent, respectively. Thus in this section we will make the 
assignment P (rmin > ^i) = Pi, without reference to the particular choice of sampling (no 
solute, with solute, etc.). It should be kept in mind that the Pq term in the OS packing 
contribution /5/ios Hsl-^*) — ~liiPo(^min > Aj) is not the i = limit case of pi. Rather, 
Po ('"min > Aj) = Pi for the case of no solute present in the sampling. When calculating the 
IS probability, then P^u {r^in > K) = Pi, and the solute is included in the sampling. 

Assume we have collected data from a set of simulations including hard-core solutes of 
varying sizes, \j, indexed in ascending order by j = 0, 1, . . . , L — 1, and we are interested 
in calculating the free energy Inp^ for some arbitrary distance > A^^i. We discretize 
the problem by separating the possible values of rmin into shells around the solute (or cavity 
center) as in Fig. [31 This creates a set of mutually independent and exhaustive events, whose 
probabilities are 

Sk = P (rmin G (Afc, Afc+i]) , (29) 

where the A values below Al must coincide with the set of hard-core sizes included in the 
simulations. The data gathered from the j^^ simulation (which included a hard-core particle 
of size Xj) consists of accumulating the counts Xj{k) of occupancy events for the nearest 
solvent molecule in the shells labeled by k. 

The likelihood of such a set of observations from the j^^ hard-sphere simulation radius is 
the ratio of two multinomial functions: 

Fi{x,{k)}\N„{s,},\,) 

Fi{xjik),k = j,j + l,...,L}\Nj,{sk}) 
P {{x,{k) = 0; A; = 0, 1, . . . , J - 1}\N„ {sj,}) 

_ N,\ut=As,/p,r^'^ 

= Aix)p-''^sf'^l[sf'\ (30) 

k=j 

where the first equality expresses the conditional probability as the joint probability of the 
observed occupancies and an existing cavity divided by the probability of the observation 
of the cavity; the further equalities follow from inserting the multinomial expressions. Here 
Nj is the total number of observations from the simulation and A is some function of x in 
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which we are not interested, since our final goal is the posterior probability of {sk} (and 
thus Pi). 

The prior chosen here is that of Haldane and Zellner, i.e. an improper Dirichlet distribu- 
tion with zero initial observations:— 

L 

P(iV,,K},A,)ocn^."'- (31) 

k=0 

This noninformative prior is a commonly used one for the multinomial case; it emphasizes 
the limits of the occupancy distributions, resulting in a rapid convergence to the final pos- 
terior distribution with increased sampling.— Other choices could have been made, such as 
incorporating smoothness in the profile. The effect of the prior distribution diminishes, 
however, as more observations are gathered.— 

The total likelihood function is a product of likelihoods from all simulations. The posterior 
distribution for the shell probabilities {sk} is calculated by multiplying the data likelihood 
with the prior distribution and normalizing: 

L-l 

P {{sk}\Nx) oc s^-'l[p-''^st\ (32) 

This procedure gives a complicated function of shell probabilities owing to the division by 
Pj = 1 — X]i=o ^k- Since we are interested in deriving a posterior probability distribution that 
aids in estimating the logarithm of pl, it is useful to carry out a transformation of variables 
from the shell probabilities to conditional increments {6i = P (rmin > Aj|rmin > K-i)}- These 
conditional probabilities are closely related to the intensity function of Gumbel,— and they 
refer to the probability of observing no solvent particles between Ai_i and Aj, given that 
the region inside Ai_i is empty. In terms of these new variables, the logarithm of pi is 
additive (since pj = Y[i=i^i)- To connect with the SPT discussion above, we note that 
Si = pi{l — The mapping here is that Si is the probability of closest approach, pi is the 

cavity probability, and (1 — is the conditional probability of finding a solvent molecule 
in the shell given the existence of the cavity. Thus we can note that the 47rr^j^ appearing 
in Eq. fl2^ is carried by (1 — Then the Jacobian for the transformation is Y[i=i 

and the posterior becomes: 

L-l 

P m\Nx) (X pi--' n 5f-M'-^'-'(l - 5:+if-\ (33) 

13 



where we have defined bin totals as 

k 

C, = Y^Xj{k) k = 0,l,...,L-l (34) 

L-1 L-1 L-1 

j=0 j=0 k=0 

This transformation of variables has thus led us to a simple form for the posterior dis- 
tribution. Because the posterior of Eq. ([2SD is a product and does not contain cross-terms 
between the Si, it shows that our estimate for each 6i follows an independent beta distribution 

P {6.\Nx) = f^^^Sr\l - S,f^'\ (35) 

with parameters 

i-l i-1 
j=Q k=0 

A = 0-1- (37) 

This is what we might have suspected from the start, since the number of counts in bin 
k should really be combined from all simulations that can observe such an event. The total 
number of observations used for determining each increment, (ai + jSi), is therefore all counts 
from these simulations {Xj < Aj) in shells k > (i — 1), i.e. a sub-block of Table [T] including the 
lower left corner. The above discussion also warns us that the probability for any increment 
will be more uncertain if either or jSi is small (corresponding to no counts above Aj or no 
counts inside (Aj_i,Ai], respectively). 

The free energy is constructed by adding together the incremental free energies: 

L 

Hpl) = Y,H6,). (38) 

i=l 

We derive the Bayesian minimum mean-squared error estimator— by integrating over the 
posterior pdf: 

E(ln5) = ^;§±gy J d6 ^--^(l - 5)^-4n 5 

= V'o(a)-^o(a + /5). (39) 
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Here ipn{z) is the polygamma function.— A similar integral can be derived for the variance 
(below). Since the the means and variances of independent random variables are additive 
under summation, the Bayesian estimates for the free energy and its variance are then 

L 

E (Hpl)) = 5^ V^o(ai) - Mc^i + A) (40) 

i=l 
L 

E{{ln{pL)-E{ln{pL))f) = ^ ^,(a,) - ^^(a, + /?,). (41) 

i=l 

The simplest way to carry out the whole calculation is to tabulate Xj{k) as in Table [T] 
and compute partial sums over sub-blocks to obtain a set of L ai and (3i values. This 
process must be repeated for each choice of at which the inner-shell or outer-shell packing 
excess chemical potential value is to be estimated. We used the scipy special python library 
implementation of the polygamma function for the calculations in this paper.— At the end 
of the next section, we discuss how this Bayesian approach to occupancy statistics can be 
implemented using molecular dynamics simulation data. 

IV. INCLUSION OF A CONTINUOUS MODEL POTENTIAL FOR MOLECU- 
LAR DYNAMICS SAMPLING 

Our goal is to implement the approach outlined above in large-scale molecular dynamics 
simulations. In order to carry out the indicated averages involving hard-sphere particles, we 
can employ a change of system energy function to one including a repulsive but continuous 
model potential, M(r). The model potential is chosen such that the hard-sphere condition 
is likely satisfied. A helpful expression results from the PDT formula for averages:— 

^g-/3AC/Hs(A) 

(F)x+M = /„-/3AC/hs(A)\ ^ (-^kmin > -^)Af " (^2) 
\^ /M 

With this approach the averaging with the hard-sphere and model potentials included is 
re-written as a conditional average with only the model potential in the sampling. Since the 
repulsive model potential is chosen to closely mimic the hard-sphere potential, relatively few 
configurations are discarded, and good statistics can be obtained. 

We first focus on the long-ranged OS contribution to the excess chemical potential, 
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LR- We re-express that contribution as 



/ c/a;^e"^'^^e"'^^^Hs(A)g-/3Ai/ 

J dx^Q-PUNQ-fiAUHsW 

/ -/3[AC/-A/]\ 
_ \^ /A+M 

~ /p/3M\ 

/A+M 

/ -/3[AC/-M] I > X \ 

~ /p/9M 1^ . > \\ 

\c I ' mm ^ jv/ 

or in the inverse form 



(43) 



e' 



/ c;a;^e"^^^e"'^^'^Hs(A)g-/3AC/g/3A(7 

J (lrf.NQ-l3UNQ-P^UHsWQ-fiAU 

/ f5[AU+M]\ 

\y / \+M+AU 

\^ I \+M+AU 

(e^I^^+^1 \r^^ > A) 



M+AU 



(el3M \r ■ ^ \) ' ^^^^ 

\c I 'mm ^ ^/M+AC/ 

Cumulant expansions can be performed on Eqs. (I43l) and (l44l) . leading to the following 
relations (to second order in /3): 

3^ 13'^ 
< /3(Af/|r^i„ > A)^, + ^Vmm (45) 

and 

/3/^os,LR ~ /9(At/| > A)^,^^j, + ^ (ly^^t; + PFmm + 2Wum) - Y^MM 

> P{AU\r^,^ > A),,+^^ - ^Wmm. (46) 
Here Vuu, Vmm, Wuu and Wmm are variances: 



Vc/t/ = 




(47) 


Vmm = 




(48) 


Wuu = 


(Af/Vmin > A)^,^^^ - (Af/|r^i„ > A)^,+^j, 


(49) 


Wmm = 




(50) 



and VuM and Wum are covariances: 

VuM = (Af/M|r^in > A)^ - (Af/|r^in > A)^,,^ (M|r^i„ > A)^^^ (51) 
VTc/M = (At/M|r^in > A)^^^^ - (At/|r^in > A)^^^^^ (M|r^i, > \) m+au ■ (^2) 

16 



In this way, bounds can still be recovered from the model potential simulations. For the rest 
of the discussion, the M-sampled system will be referred to as the reference system, and 
the (M + Af/)-sampled system will be referred to as the coupled system; the fully coupled 
system is recovered when no hard-sphere condition or model potential is applied in a coupled 
simulation. 

The conditional variances of the model potential (Vmm and Wmm) appear as extra quan- 
tities widening the bounds. The best choice of model potential would therefore minimize 
these variances as much as possible. To accomplish this, note that for any model potential 
that is zero outside the chosen hard sphere radius, Vmm and Wmm become identically zero. 
On the other hand, if the model potential has a repulsive wall well inside the hard sphere 
radius, many configurations will violate the hard-sphere condition and the simulations will 
not yield good statistics. Since the aim of introducing the model potential is to allow molec- 
ular dynamics sampling, it must be continuous and have a derivative that does not make 
the system numerically unstable with the chosen time-step. These considerations suggest 
choosing M(r) to be as steep as possible at the chosen hard-sphere boundary. 

For the calculations performed here, a Lennard- Jones potential with Weeks-Chandler- 
Anderson (WCA)''^ style truncation at the minimum was used for M(r). The coefficients 
were chosen to make the model potential's value at the hard sphere boundary equal to a 
defined constant Ex and to force the minimum to coincide with a given cutoff i?c,A > 

M{r) = I ' 

[0 r > i?e,A 

_ 2ci2 _ Ex 

Ce — 7^fi~; Ci2 — — 2- 

RU i^-'-RcS 

Initial simulations used a value of ksT /2 (where fc^ is the Boltzmann constant and T is 
the temperature) for Ex, and the choice i?c,A = A + O.lA for the cutoff radius. We noticed, 
however, that more configurations were rejected at larger model potential radii and from 
the coupled systems due to violations of the hard sphere conditions, giving worse sampling 
for these cases. Various alterations of the model potential were explored to enhance the 
number of accepted configurations. For coupled simulations of Na'*' and Cl~ and large-A 
simulations of CH4 and CF4, we used a new choice for the cutoff radius, i?c,A = 1.05A, but 
the ion simulations continued to exhibit low acceptance probabilities. 
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(53) 



We then pursued an alternative strategy of setting the M(r) parameters to fixed values 
and altering the HS size A downward, based on the distributions of solvent closest approach, 
until substantial acceptance probabilities (on the order of 80%) were obtained. Similar 
alterations were also applied for the largest A simulations of TIP3P water. A sequence of 
M(r) potentials with increasing effective particle size, followed by adjustment of the HS size 
A to yield sizeable acceptance probabilities, thus appears a sensible procedure for generating 
the data for a range of A's. 

The origin of the problem of determining the M (r) form is that the accessible volume for 
penetration of solvent molecules inside the specified hard-sphere A increases for larger radii. 
Future work should explore alternative model potentials that are harsher near the A radius. 
We note, however, that the size range considered in our simulations (below) is already quite 
large for the cases of interest, and the hard cores of larger molecules can be considered as 
additive contributions from multiple hard spheres with typical sizes on the order of those 
examined here. 

In the present study, free energy bounds on the OS long-ranged part of the free energy 
were obtained as discussed above. In order to obtain an accurate estimate of /x^g lr based on 
Eq. fl2^ . a re- weighting strategy was employed, in which the data from the model potential 
simulation yields an estimate of the mean-field energy with the HS particle included in the 
sampling: 



(AC/). = 



(Af/e^^^l 






^min ^ A) j^j 



(54) 



A similar equation holds for the coupled case. This re-weighting procedure is valid when 
the model potential is chosen small enough outside A that adequate sampling of this area of 
configuration space is achieved. 

Re-weighting is also necessary for calculating the shell occupancy observation numbers 
Xj{k) (for the Bayesian analysis of Section [In|) from model potential simulations. Assum- 
ing Nj independent samples satifsying the HS condition have been collected from a model 
potential simulation mimicking a hard sphere of radius Rj, the reweighting is given by 



xj{k) = N, 





kmin £ (Afc, Afc+l]) 




^min ^ Aj) 



As a check on our mean-field average formula, Eq. ( 124|) . the histogram overlap method 
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can be applied to straight-forwardly estimate /^/igs lr- 




f3AU + In 



PM+Au{AU\rmm > A) 

PM(At/|r^in > A) 



+ In 



<e^^^ Ir^in > A) 



M 



(56) 



{e"^' Kin > A) 



AI+AU 



A direct approach involves employing the BAR method^^ on the interaction energy distribu- 
tions from the model potential simulations. The BAR estimation is enacted on the first two 
terms on the rhs above. The last term serves as the constant correction to the BAR estimate 
of the free energy difference between coupled and reference model particle simulations. We 



of the Gaussian approximation, Eq. (!24l) . This approach can only be employed, of course, 
when some overlap exists between the interaction energy distributions. 

To conclude this section, we outline a strategy for implementing the above theory in 
molecular dynamics simulations. We first perform simulations with the solute included and 
then with no solute (and no model potential). The occupancy counts for the distance of 
closest approach of the solvent to the solute in the first case, and to the cavity center in 
the second case, are assembled. The quantities a;o(A) and Po{X) can then be computed 
out to radii for which good statistics are obtained in those probability distributions. The 
degradation of the statistics at larger radii is easily monitored by plotting the estimated 
variances in addition to the chemical potentials. At the radii where the variances become 
large, model potentials M(r) are inserted to begin to force solvent molecules further from 
the solute. Occupancy observation numbers Xj{k) are assembled in each of the simulations 
for the subsequent Bayesian analysis leading to Xo(A) and Po(A). This process is continued 
until radii are reached at which the Gaussian approximation for the long-ranged contribution 
becomes accurate. 

The same set of simulations of the reference and coupled systems yield estimates of the 
OS long-ranged part of the free energy /S/igs lr- The two terms in Eq. flMl) are estimated via 
Eq. (!54l) and its coupled simulation counterpart. A check of the mean-field approximation 
Eq. fl24|) can be obtained by computing the total fluctuation terms in the first expressions of 
Eqs. (1451) and ( H6l) : the extent of cancellation of these fluctuations monitors how closely the 
energy distributions mimic Gaussian behavior (more appropriately, re- weighted forms of the 
fiucutations can be computed as was done for the mean- field term in Eq. (ISj), see below). 
Alternatively, the long-ranged contribution can be obtained by the BAR method discussed 
above, and the accuracy of the mean-field approach can be assessed directly for cases where 



utilized this approach to calculate /3yUosLR TIP3P water in order to judge the accuracy 
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some overlap of the energy distributions for the reference and coupled simulations occurs. 
Typically, only a few model potential simulations need to be performed during the 'pushing 
out' process in order for this procedure to converge to results close to the exact free energy 
for the cases examined here. Numerical tests of this procedure are presented below. 

V. MOLECULAR DYNAMICS SIMULATIONS 

In order to test the formalism, several solutes ranging from nonpolar to polar to ionic 
were examined. The CH4 and CF4 molecules and the Na"*" and Cl~ ions were each simulated 
in the NPT ensemble using the Gromacs version 3.3.1 code.— The solutes were modeled in 
solvent boxes that included 516 SPC/E waters. The Lennard- Jones potential parameters 
for the various molecules and ions simulated are given in Table Ull and the particle-mesh 
Ewald (PME) method^ was employed to handle the long-ranged electrostatics interactions. 
For reference, the solute/solvent radial distribution functions for each of the above solutes 
and for the TIP3P water case (below) are presented in Fig. HI 

The temperature was taken as 298.15 K and the pressure as 1 bar. Fully coupled systems 
were equilibrated for 2 ns with the Berendsen temperature and pressure-coupling algorithms. 
Then 2 frames per picosecond were recorded during 5 ns of production runs that employed 
the Nose-Hoover thermostat and the Parrinello-Rahman barostat. The OS long-ranged 
free energy contributions were calculated using two simulations corresponding to the two 
sampling distributions indicated in Eqs. |45] and |46] for each value of the solute-to-water 
oxygen exclusion distance A. Simulations including repulsive model potentials were run 
starting from the end of the fully coupled simulations. Nonpolar solutes were equilibrated for 
1 ns, while ions were equilibrated for 0.5 ns followed by production runs of 5 ns (nonpolar) 
or 1 ns (ions). A cubic nonbonded interaction switching function was employed for the 
Lennard- Jones interactions between 10 and 11 A, and no cutoff correction was used for 
these interactions. 

We also computed the solvation free energy of TIP3P water ''^ in water. These simulations 
were conducted since the TIP3P model is widely used in biophysical simulations, and previ- 
ous results have been obtained for comparison. The TIP3P water simulations were carried 
out similar to the above, with 216 waters and a 9 A Lennard- Jones cutoff. Production runs 
were of 4.5 ns duration following equilibration periods of 0.7 ns. Comparison results can be 
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found in Refs. 



80 



-22.6 kJ/mol) and 



8ll (-23.43 kJ/mol). Although these previous studies 
did not use Ewald summation, a study on a different water model^ showed httle difference 
between cut-off based and Ewald-based electrostatic treatment for the solvation free energy 
calculations on water. We note that the free energy results obtained by Paliwal et alM differ 
somewhat since a slightly different TIP3P water model®^ was employed in that work. 

It is important to note that the difference in electrostatic energy between the charged 
(coupled) and uncharged (reference) states necessary to calculate the long-ranged free energy 
contribution for ions includes a net total charge correction defined in Ref. |47|, Eq. 5.50. 
Gromacs version 3.3.1— includes this correction, so the total electrostatic energy difference 



can be calculated from a simple re-run through the trajectory at a different ion charge state. 



VI. RESULTS 



We begin by discussing the free energy results as a function of the IS/OS partitioning 
radius for the various molecular and ionic cases considered here. Due to the relatively small 
size of the systems examined, extensive statistical sampling was performed to obtained fully 
converged results. Below we will examine the issues of convergence rates with simulation 
time and the number of required model potential simulations for statistically reliable results. 
The main focus of these initial free energy results is to assess the accuracy of the QCT free 
energy calculation scheme presented above. 

We start with the OS packing contribution, /ios,Hs(-^)' whose calculation does not depend 
on the solute-solvent interaction energy. That packing contribution was computed from 
simulations of pure water and the reference (model) potential systems used in this study; 
the results are presented in Fig. [51 The occupancy data was processed using the Bayesian 
approach of Section UTTl The model potential M(r) was identical for the CH4, CF4, Na"*", 
and Cl~ systems due to use of the same water model and excluded volume definition. TIP3P 
water requires its own free energy calculation, and the results for this model are increasingly 
lower than the SPC/E results for larger A values (off by 3.7 kJ/mol at 4 A). This discrepancy 
is mostly due to the slightly lower density of TIP3P water at 1 bar. In line with the Gaussian 
information theory estimate,*^ we found that the packing contribution can be estimated 
fairly well from the mean and variance of the occupancy number distribution up to A values 
of around 3.3 A. This method overestimates the solvation free energies for larger A values. 
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however, due to cooperative de-wetting transitions near a cavity.— The exact results start 
at zero for A = and increase to large positive values for large radii. 

Another observation is that the revised scaled-particle theory (SPT)^ expression 

^^'Ss = -l!^^^^ + e - 167ri?7oo5 + ^rrR'^oo + y i?Vt (57) 

accounts almost perfectly for the large-radius solvation behavior of our hard-core solutes, but 
must be carefully parameterized. This theory attempts to model the transition between the 
microscopic and macroscopic limits by least-squares fitting of the above yUQg hsI-^) estimates 
to the SPT model. The SPT model was calculated as a cubic interpolation of the microscopic 
(simulated) and macroscopic (above) expressions between two radii with 700 = 71.99 mN/m 
and neglecting the vapor saturation pressure. Using calculated chemical potentials with radii 
spaced every 0.1 A from 2 to 3 A gave the best fit, with 6 = 0.51 A, A = -3.95, and e = -1.055. 
Other ranges (2.5-3.5 and 3.0-4.0 A) gave positive 6 and A parameters. The Gaussian model 
and the revised SPT theory provide alternate simpler routes to the estimation of Aios,Hs('^) 
where appropriate. We note that our numerically exact results can be calculated out to 
quite large radii due to the inclusion of the model potentials in the simulations. 

The IS chemical potentials were computed for each test solute via the Bayesian method 
of Section llllj the results are presented in Fig. O Because this quantity is based on the 
distribution of the closest solvent atom, the same calculation method as that for the packing 
term was applied. The statistical convergence and accuracy obtained here for the IS free 
energies results from the use of coupled simulation data at multiple Aj values. The free 
energy for the IS term starts at zero for A = 0, remains at zero until nonzero values in the 
solute/solvent radial distribution function are observed, and then steadily decreases to more 
negative values. An interesting observation is that the IS free energies appear to decrease 
nearly linearly at the larger A values studied. This corresponds to an exponential decay in 
P ('"min > A) for large A. 

Since the IS and OS packing contributions start at zero for small radii and move in 
opposite directions with increasing radius, these two contributions tend to cancel to some 
extent. The sum of the IS and OS packing terms can be thought of as the work to carve 
out a cavity of size A followed by the free energy change to then release the nearby waters 
to make direct contact with the solute, so some extent of cancellation of these terms is to 
be expected. For the nonpolar solutes, the magnitude of the packing term is larger than 
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that of the IS term for all A values. The water case is interesting in that the OS packing 
and IS terms cancel at A = 3.2A (where the OS packing excess chemical potential is 21.49 
kJ/mol); thus the OS long-ranged contribution yields the full solvation free energy at that 
radius. Similarly, the Na"*" and Cl^ ion cases display the same cancellation at 2.26 and 3.39 
A, respectively. This behavior is some indication of the differences between nonpolar and 
polar solvation: nonpolar solvation is dominated by packing contributions, while for polar 
solvation the local binding of solvent molecules to the solute molecule overcomes the packing 
penalty. 

The remaining part of the present free energy partitioning scheme is the long-ranged 
contribution /UoslrI-^)- discussed above, the conditioning inherent in this term (due to 
the 'pushing out' of solvent molecules from direct contact with the solute) leads to near- 
Gaussian behavior in the interaction energy distributions, thus suggesting a second-order 
perturbation approach may be accurate. We tested that proposal by computing the bounds 
on /^oslrI-^)' approximate mean value using Eq. flM|) . For the two nonpolar cases 

(CH4 and CF4) the bounds are very close to one another over the whole range of A values 
(Fig. [5]). The OS long-ranged free energy contribution /i^g lr('^) first decreases with increas- 
ing A, exhibits a minimum, and then begins to rise; the value must tend towards zero at 
large radii, but this asymptotic approach is very slow, even for the nonpolar solutes. The 
initial decrease is due to the removal of any near hard-core overlaps in the sampling. The 
free energy bounds are apparent for the smallest radii considered for the CF4 molecule, indi- 
cating the fluctuation terms make a visible contribution for those small radii. Nevertheless, 
the narrowness of the bounds is an indication of why first-order perturbation theory or the 
van der Waals approximation^^ works well for solutes of this type. The total solvation free 
energy computed by summing the IS, OS packing, and long-ranged components from our 
Gaussian estimate is essentially exact for all radii considered. These results clearly illustrate 
the lack of dependence of the QCT free energy on the conditioning radius A.— 

The inclusion of charged interactions for the water and ion cases results in a large increase 
in the separation of the upper and lower bounds for /igs lrI-^)- -^^^ water. Fig. [5] shows that, 
for small exclusion radii, the separation of the bounds is nearly 80 kJ/mol, and decreases 
in magnitude to about 20 kJ/mol for the largest A values considered. Comparison with 
the numerically exact BAR results shows that the mean-field average formula Eq. flM|) 
incurs significant errors up to an exclusion radius of roughly 3.1 A; for larger radii, the 
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Gaussian approximation is nearly exact. The sum of all three components of the free energy 
is correspondingly accurate beyond an exclusion radius of 3.1 A. An indication of the error 
in the Gaussian approximation for the water case is also displayed in Table lllli There we 
list the re-weighted fluctuation parts of the average of the two Gaussian approximations in 
Eqs. ([22D and These fluctuation differences decrease with increasing exclusion radius, 
indicating the excellent accuracy of the Gaussian approximation. In addition, it is apparent 
in Table HIT] that the bound- widening terms Vmm and Wmm are very small in magnitude for 
the model potentials used here. 

The impact of the conditioning on the water interaction energy distributions for the 
long-ranged contribution is shown in Fig. O With no conditioning, the distributions are 
highly non-Gaussian and the interacting and non-interacting cases are widely different. With 
increasing exclusion radius A in the sampling, the distributions become increasingly Gaussian 
with small deviations in the tails. This is the physical justification of our mean-field average 
formula Eq. ( l24|) . 

Last, we discuss the results for the Na"*" and Cl~ ions (Fig. [5l). For these charged cases, no 
overlap of the interaction energy distributions occurs for any exclusion radius examined, and 
the free energy bounds are separated by ^700 kJ/mol for the Na"*" case and ^600 kJ/mol 
for the Cl~ case. Yet a small amount of conditioning results in accurate estimates of the 
total free energy, to within several kJ/mol compared with the numerically exact result; the 
errors range from 1.8 to 5.2 kJ/mol for the Na"*" ion and 0.82 to 7.8 kJ/mol for the Cl~ ion, 
and thus are all within 2 kcal/mol of the exact result. The largest contribution to the total 
free energy is the long-ranged component /iosLRl-^)' comprising 95-99% of the total at the 
chosen radii. These results are remarkable in that two completely non-overlapping limits 
can be used to obtain an accurate free energy estimate with no intermediate sampling. This 
results from the conditioning due to the repulsive particle included in the sampling. 

To give some indication of the efficiency of the spatial conditioning approach described 
here for the IS and OS packing terms, we plot yugs lrW /ifs (A) as a function of A, along 
with the mean plus and minus the variance for simulations for the TIP3P water case. Fig. [7] 
displays the results for a 4.5 ns simulation. From this plot it is clear that good estimates 
of the IS and OS packing contributions can be obtained out to a A value of roughly 3.2 
A using a simulation with no model potential. Inclusion of two additional model potential 
simulations allows for accurate estimation of these contributions out to A = 3.5A, near the 



24 



first minimum in the water-water g{r). In Fig. [TJ we plot the corresponding data for a 200 
ps simulation. It is clear that the reduced sampling leads to a larger variance at smaller 
radius, and two or three simulations are necessary to push accurate results out to a radius of 
about 3.3 A. From simulations of other lengths (100 ps, 500 ps), it is apparent that increased 
simulation time leads to a decrease of the variance at a given radius. From these results, we 
conclude that an accurate estimate of the IS and OS packing contributions for water can 
be obtained from a few model potential simulations over relatively short simulation times 
(roughly 600 ps total). 

For radius values in the range 3.1A< A < 3.5A, the mean-field approximation for the 
long-ranged component discussed in this paper is quite accurate. We also computed this 
quantity for each simulation time to test for convergence in the mean-field estimate of the 
long-ranged part of the free energy (water case). For a A value of 3.2 A, in a 200 ps simulation 
the OS long-ranged free energy contribution is converged to within 0.56 kJ/mol, and for a 
radius of 3.5 A the free energy is converged to within 0.79 kJ/mol relative to the 4.5 ns 
result. Thus we can see that the long-ranged portion of the free energy is well- converged 
during the simulation times required to estimate the IS and packing parts of the free energy. 

VII. CONCLUSIONS AND DISCUSSION 

This paper presents a spatial stratification scheme for accurate and efficient estimation 
of absolute solvation free energies. Quasi-chemical theory leads to a natural division of the 
solvation free energy into inner-shell (IS) and outer-shell (OS) components.— The OS free 
energy can be further divided into packing and long-ranged contributions. In this paper, an 
efficient Bayesian approach was developed to estimate the IS and OS packing contributions 
based entirely on occupancy statistics within shells specified by a hard-sphere cavity size 
A. A method for sampling the occupancy statistics during molecular dynamics simulations 
allows for efficient generation of the free energy profiles as a function of A out to large 
cavity radii. Our spatial partitioning method differs from alternative stratification methods, 
such as those that soften the repulsive portion of the intermolecular potential^ for direct 
TI implementations; the present approach provides a more direct handling of the packing 
issues involved in absolute solvation free energy calculations. 

The long-ranged component of the OS free energy is the free energy of solvation of the 
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solute with solvent excluded from the region specified by A. Thus this free energy results 
from sampling many configurations with energy contributions that have been conditioned to 
disallow penetration of the solvent into direct contact with the solute. The distributions of 
the interaction energies then assume a near-Gaussian form owing to the combined contribu- 
tions of many nearly independent, comparable-magnitude terms.— This observation suggests 
that a second-order expansion of the free energy is accurate for this term; that expectation 
is born out in the computational results for a range of molecular and ionic solute types. 

The free energy calculations reported here require two sets of simulations that include 
continuous repulsive model potentials during the sampling. In the first (reference) set , the 
model potential particle is simulated while embedded in the solvent; in the second (coupled) 
set, the sampling includes the model potential particle with the full solute/solvent interaction 
potential included in addition. Each of the two sets requires only a few simulations to 
generate the free energy profiles as a function of A with high accuracy. The same sets of 
simulations can be used to generate the long-ranged contribution to the free energy. The 
accuracy of that contribution can be monitored by examination of the cancellation of the 
fiuctuation contributions to the free energy; once the reference and coupled estimates of 
the fiuctuations nearly cancel, the mean-field estimate is of high accuracy. Since this free 
energy estimate is the average of two mean-field terms, the calculations converge rapidly with 
simulation time. The estimates are robust because each part of the calculation includes an 
accurate error estimate in addition to the mean value. 

Several solutes were examined in this study, ranging in type from nonpolar to polar 
molecules and then ions. The nonpolar solute cases exhibit narrow bounds on the long- 
ranged contribution, consistent with the fact that a van der Waals approach is appropriate.— 
The bounds on the OS long-ranged free energy contribution are much wider for polar 
molecules and ions, yet with a relatively small degree of conditioning due to inclusion of 
a repulsive model potential, the mean-field estimates are quite accurate for the cases exam- 
ined. For the examples of the Na"*" and Cl~ ions, the distributions of interaction energies 
are completely non-overlapping during the reference and coupled system sampling. Yet the 
spatial conditioning still yields reliable free energies with only the two limits sampled, due 
to the induced near-Gaussian energy distributions. 

The standard particle insertion approach to the PDT excess chemical potential^ relies 
on the availability of accessible cavities in the solvent in which the solute samples favorable 
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solvation environments. Even though this approach is formally exact, solutes of appreciable 
size are not likely to encounter instantaneous cavities of sufficient extent during standard 
molecular dynamics sampling of the solvent. This provides a central motivation to the 
development of the quasi-chemical theory (QCT). An example case where the QCT approach 
should find useful applications is in computing free energies for ions or molecules in highly 
restricted environments; ion channels provide one such example.— Another example might be 
water molecules in specific localized binding sites inside proteins.— The approach developed 
here and elsewhere will allow for accurate pointwise absolute free energy estimates at specific 
locations in inhomogenous systems. 

The long-term goal of the methods developed here is to examine specific ion effects related 
to Hofmeister series that impact on a wide range of chemical and biological phenomena.— 
Specific ion effects are often ascribed to one of two causes, ion size^ and polarizability.— 
The effects tend to be larger for anions due to their higher polarizability and larger size 
relative to cations. The ion size influence on free energies can be relatively easily estimated 
by focusing on the packing and long-ranged contributions in the QCT. 

The effects of anion polarizability (and solvent polarizability) are more challenging. We 
suggest that the OS packing and long-ranged contributions are likely to be relatively accu- 
rately estimated by classical simulations, since for both these cases, the solvent is excluded 
from direct contact with the solute. The IS component is the most-likely candidate for re- 
quiring an accurate quantum mechanical treatment, as has been discussed in a wide range of 
work— that recasts the IS term directly into a chemical equilibrium expression. Limitations 
of that approach for modeling anion free energies have been noted,— however, due to the 
large number of possible ion/water complexes with comparable energies. In anion/water 
cluster calculations, surface solvation states have been observed in addition to a large num- 
ber of near-lying isomers.— The solvation environment in the condensed phase can be 
expected to differ from that in clusters.— "^i^i 

The approach developed in this paper provides a direct route to the IS part of the solva- 
tion free energy. The only information required to compute that contribution is the solvent 
occupancy statistics within the region specifled by the exclusion radius A, and no compu- 
tation of interaction energies is required. Thus we suggest that this approach is well-suited 
for ab initio modeling of anion solvation free energies in the condensed phase. Future work 
will attempt to extend the ideas developed here to those systems, with an aim of unraveling 
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the various contributions to specific ion effects in solvation and ion binding to proteins. 
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TABLE I: Shell occupancy data, Xj{k), for calculating /-*os HS SPC/E water. The hard-sphere 
conditioning radius indexed by j runs along the columns, and the shell index k (from zero to 3.5 
A) runs along the rows. 



Aq" Ai A2 A3 
(0,Ai] 719415 

(Ai,A2] 3607138 8052 

(A2,A3] 831339 1821 8934 

(A3,A3.5] 2602 5 33 7438 

(A3.5,oo) 22 1 64 



Tor ease of exposition, the subscripts correspond to the chosen shell radii (in A). 
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TABLE II: Lennard-Jones force-field parameters 







Cl2 




R 


CH4:0W 


5.873168-10~3 


9.514026-10"*^ 


0.906 


0.385 


CF4:0W 


1.34270763-10"^ 


4.967905-10"^ 


0.907 


0.441 


Na+:OW 


4.3426-10-'' 


2.3523-10"'^ 


0.200 


0.320 


C1-:0W 


6.0106-10-3 


1.6778-10-5 


0.538 


0.421 


SPC/E OW:OW 


2.6173456-10-3 


2.634129-10-*^ 


0.650 


0.355 


TIP3P OW:OW 


2.4889-10-3 


2.4352-10"*^ 


0.636 


0.354 



"Units are kJ/mol for energy and nm for distances. 
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TABLE III: Gaussian approximation error width (kJ/mol) from TIP3P water /^qslr calculation, 
and bound widening terms. 





2"'' OyAcv Conxx-lioti " 


Bound AA'idciuug'' 


2.6 


15.9 


0.03 


2.7 


10.7 


0.06 


2.8 


11.6 


0.07 


2.9 


7.7 


0.08 


3.0 


6.8 


0.08 


3.1 


5.3 


0.09 


3.2 


4.0 


0.10 


3.3 


2.7 


0.09 


3.5 


1.0 


0.76 



2 (W^MM + Vmm) 
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FIGURE CAPTIONS 
Figure 1. Distributions of binding energy of methane to SPC water. The probabihty 
distribution of the interaction energy between a one-site methane solute and SPC water (9 
A LJ cutoff). Note that the overlapping region is sparsely sampled (occupying less than 
0.2% of the uncoupled distribution). The non-interacting data were taken from 2 ■ 10^ 
particle insertions during a 10 ns solvent-only simulation. The fully coupled system data 
were collected from 6000 samples over 20 ns. Two observed single-count outliers at the right 
of this distribution are included. 

Figure 2. This figure illustrates the construction of the probability of closest solvent ap- 
proach, P (rinin). P (rjnin) IS the probability of observing a cavity of size rmin and a solvent 
particle in the shell r^iri '"min + dr^i^. This is equivalent to the conditional probability 
for observing a solvent molecule in the prescribed shell, given the existence of the cavity, 
times the cavity probability. This physical picture is used to derive the SPT expression 
for the hard-sphere chemical potential. The inner circle is the unoccupied cavity, while the 
dashed circle indicates the shell for the observation of solvent closest approach. The solvent 
molecule centers are indicated by the small gray circles. 

Figure 3. Division of the minimum solute-solvent distance (rmin) into successive shells. 
Derived quantities are shown above the axis (from bottom- up): probabilities of rmin falling 
in each shell, Sk, and cavity probabilities pi. The logical organization of simulation data is 
shown below the axis (from top to bottom): bin counts Xj{k) from simulations including 
hard spheres of size Xj, total samples from each simulation, A'^-, and bin totals (k- 
Figure 4. Solute/water oxygen radial distribution functions of all systems studied. Solid 
lines (left scale) show g(R), while the integrated radial distribution functions, n(R), are 
shown as dashed lines (right scale). The RDFs for sodium and chloride can be visually 
distinguished because of their large size difference. There the Cl~ dashed line shows g(R) 
and the dotted line is the corresponding n(R). The subscripts CO, 00, and 10 label the 
carbon/water-oxygen, water-oxygen/water-oxygen, and ion/ water-oxygen distances, respec- 
tively. 

Figure 5. Solvation free energy contributions for the various systems studied. In each 
plot, the solid horizontal line labels the comparison result: test particle insertion for CH4 
(9.3 kJ/mol) and CF4 (15.3 kJ/mol), the result of White and Meirovitcb^ for T1P3P water 
(-23.43 kJ/mol), and test particle insertion plus charging free energy for the ions (-392.5 
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kJ/mol for Na"*" and -370.0 kJ/mol for Cl~). The (x) symbol is for the IS contribution, the 
(+) symbol labels the OS packing component, and the (*) symbol is for the OS long-ranged 
contribution computed from our mean-field average. The bounds on the OS long-ranged 

contribution are indicated by dotted lines in each figure, and the BAR result for TIP3P 
water is given by a line as indicated in the figure. Triangles label the sum of all free energy 
contributions as calculated using the present methodology. 

Figure 6. Effect of conditioning on the interaction energy distribution. The upper left 
figure contains no conditioning. The remaining figures are labeled with the hard-particle 
conditioning radii. TIP3P interaction energy distributions become increasingly Gaussian as 
A increases from zero to 3.5A - evidenced by plotting the distributions corresponding to the 
coupled (x) and uncoupled (+) cases. 

Figure 7. Bayesian estimation of the /xfg and A*os,hs profiles for TIP3P water. Each set of 
lines with the same style shows the mean plus or minus one standard deviation, leading to 
upper and lower estimates of the free energy. In the top panel, /^os.hs calculated using 
all 4.5 ns of data from model potential simulations with radn (1^* branch), 2.7 (2'^'^), and 
3.3 1(3'''^). Similarly /xfg used radii of 0, 3.0, and 3.2 A. In the lower panel, only the last 
200 ps of data from simulations with model potential radii of 0, 2.6, 2.9, and 3.2 A were 
used to calculate /Xqs hs 

and radn of 0, 2.9, 3.0, 3.2, and 3.3 A for /xg. 
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